fix(agent-worker): propagate runId + runJobToken through JobEventSchema#874
Conversation
PR #871 flipped LOBU_SESSION_STORE default to snapshot mode. PR #865 added a startup assertion that the worker throws if snapshot mode is on but WorkerConfig.runId is missing. Together those broke every Telegram chat in prod with: "Snapshot mode (LOBU_SESSION_STORE != 'file') but WorkerConfig.runId is missing — runs-queue dispatch did not stamp runId on the job payload" The gateway-side MessageConsumer correctly sets data.runId (line 149) and data.runJobToken (line 185) before dispatch. job-router writes the full payload to SSE. The worker reads payload.runId / payload.runJobToken in payloadToWorkerConfig (sse-client.ts:925-935). The missing link was JobEventSchema. Its inner payload object used plain z.object(...) which is strict-strip by default — runId and runJobToken were silently dropped at safeParse, so payload.runId was always undefined and the assertion fired on every message. Fix: declare runId + runJobToken explicitly on the schema, and add .passthrough() so future MessagePayload fields (mcpConfig, nixConfig, egressConfig, preApprovedTools, exec*, organizationId, networkConfig…) don't regress the same way. Tests: - new regression test feeds a job event with runId + runJobToken through handleEvent and asserts they reach handleThreadMessage (pre-fix: undefined; post-fix: preserved) - new test pins payloadToWorkerConfig's mapping of runId/runJobToken - new test confirms the legacy direct-enqueue path (no runId) still threads undefined cleanly
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds support for propagating ChangesRunId and RunJobToken Propagation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ects for new runs Live prod bug — third in the Phase 5 chain. Snapshot mode is default, worker correctly POSTs to /worker/transcript/snapshot with the right runId (PR #874), but the gateway's isRunOwnedByJwtScope verifier rejects with 403 on every call because `runs.action_input` is stored as a JSONB **string** (double-encoded), not a JSONB object. The verifier's `->> 'agentId'` returns NULL on a JSONB string, so the scope comparison fails. Root cause traced to runs-queue.ts:309 — `JSON.stringify(data)` was bound to a `$4::jsonb` parameter, which Postgres ingested as a JSONB string scalar. Fixed by passing the object through postgres-js's `sql.json()` helper so the driver sends a proper JSONB object. Two-part fix: - Verifier (transcript-routes.ts): CASE jsonb_typeof to handle both shapes — object rows use direct `->>`, string rows unwrap via `(action_input #>> '{}')::jsonb`. New rows post fix always take the 'object' branch; legacy in-flight string rows authorize correctly during the deploy crossover window. - Dispatch (runs-queue.ts): write JSONB objects directly via sql.json going forward. New chat_message / task rows store proper objects. Tests cover both shapes in the verifier and the new dispatch shape.
…ects for new runs (#877) * fix(server): handle action_input JSONB-string shape + write JSONB objects for new runs Live prod bug — third in the Phase 5 chain. Snapshot mode is default, worker correctly POSTs to /worker/transcript/snapshot with the right runId (PR #874), but the gateway's isRunOwnedByJwtScope verifier rejects with 403 on every call because `runs.action_input` is stored as a JSONB **string** (double-encoded), not a JSONB object. The verifier's `->> 'agentId'` returns NULL on a JSONB string, so the scope comparison fails. Root cause traced to runs-queue.ts:309 — `JSON.stringify(data)` was bound to a `$4::jsonb` parameter, which Postgres ingested as a JSONB string scalar. Fixed by passing the object through postgres-js's `sql.json()` helper so the driver sends a proper JSONB object. Two-part fix: - Verifier (transcript-routes.ts): CASE jsonb_typeof to handle both shapes — object rows use direct `->>`, string rows unwrap via `(action_input #>> '{}')::jsonb`. New rows post fix always take the 'object' branch; legacy in-flight string rows authorize correctly during the deploy crossover window. - Dispatch (runs-queue.ts): write JSONB objects directly via sql.json going forward. New chat_message / task rows store proper objects. Tests cover both shapes in the verifier and the new dispatch shape. * chore(submodule): bump owletto to clear drift check
Summary
PR #871 flipped
LOBU_SESSION_STOREdefault to snapshot mode. PR #865 added a startup assertion that throws if snapshot mode is on butWorkerConfig.runIdis missing. Together they broke every Telegram chat in prod:The gateway sets
data.runId(message-consumer.ts:149) anddata.runJobToken(line 185) correctly.job-routerwrites the whole payload to SSE. The worker readspayload.runId/payload.runJobTokeninpayloadToWorkerConfig(sse-client.ts:925-935).The dropped link was
JobEventSchema. Its innerpayloadwas a plainz.object(...)— default zod mode is strict-strip-unknown, sorunIdandrunJobTokenwere silently removed atsafeParse.payload.runIdtherefore always reached the worker asundefined, and the assertion fired on every message.Fix
runId+runJobTokenexplicitly on the schema..passthrough()so futureMessagePayloadfields (mcpConfig,nixConfig,egressConfig,preApprovedTools,exec*,organizationId,networkConfig, ...) don't regress the same way.Diff is 5 lines of real logic; rest is the schema + comments.
Reproducer
Pre-fix — revert the schema change, run the new test:
That
undefinedis exactly the dropped field that fires the prod assertion.Post-fix — same test suite, schema restored:
Test plan
handleEvent("job", ...)parse path withrunId+runJobTokenand asserts they reachhandleThreadMessage— fails pre-fix, passes post-fix.payloadToWorkerConfigmapping intoWorkerConfig.runId) still threadsundefinedcleanly — backwards-compat preserved.make typecheckclean.make build-packagesclean.agent_transcript_snapshotfor that run.Summary by CodeRabbit
Tests
Bug Fixes